| Variable | Description |
|---|---|
| Address | Restaurant's address |
| Latitude | Latitude of the restaurant |
| Longitude | Longitude of the restaurant |
| Postal Code | All the postal code in Geneva Canton |
| Number_of_reviews | Number of reviews |
| rating | Average rating of the restaurant |
| photoCount | Number of photos given in the reviews |
| PriceRange | Menu price range |
| Cuisines | Different type of cuisine proposed |
| OpenHours1 | Opening morning hour |
| CloseHours1 | Closing morning hour |
| OpenHours2 | Opening afternoon hour |
| CloseHours2 | Closing afternoon hour |
| description | Restaurant's description |
| Features | Features offered |
| mealTypes | Different meal type |
| TrainStation Latitude | TrainStation latitude |
| TrainStation Longitude | TrainStation longitude |
| rankingPosition | Tripadvisor ranking |
What influence the number of reviews and the rating on restaurant on TripAdvisor in Geneva
1. Introduction
1.1 Motivation
In today’s data-driven world, businesses and organizations face the ever-growing challenge of understanding the factors that influence customer behavior and satisfaction. The customer reviews play an important role in shaping the success of products and services, as they provide direct insights into consumer preferences and feedback. This project aims to address the critical need for predictive modeling in understanding the impact of measurable variables on the number of reviews. By quantifying the relationship between various factors such as price, product features, opening hours, and some distances, we can unlock valuable insights to make data-informed decisions. Ultimately, this analysis is motivated by the desire to help restaurants enhance their customer engagement, optimize their offerings, and drive growth by harnessing the power of data analytics.
Switzerland is known for its high level of service excellence, whether in hotels, restaurants, or other service-oriented businesses, there is a strong emphasis on providing attentive, courteous, and efficient service to guests. In that way, we thought about Geneva.
Its hospitality industry is geared towards serving a diverse and often international clientele, leading to a cosmopolitan and inclusive atmosphere.It allows us to explore whether certain types of restaurants are more popular among its population, and if there is an important variation in customer preferences.
The Canton of Geneva is composed of different municipalities. We have selected only the ones which are directly dependent on the city of Geneva. As a first step, the idea is to work with them individually.
1.2 Research Questions
What variables influence the number of reviews and the rating on restaurant left on TripAdvisor in Geneva’s restaurant industry?
1.3 Exploratory Questions
- Which gastronomy is the most popular?
- What are the most recurrent features in restaurant descriptions?
- On average, how many reviews and ratings does a restaurant have?
- Average opening hours…
1.2.1 Data Presentation
Source: https://www.tripadvisor.com/Restaurants-g188057-Geneva.html. Our first and largest database is from Tripadvisor. This data provides us with information on the restaurant address, its coordinates with its postal code, the number of reviews, the rating, the type of cuisine etc..
1st Data Frame: Most relevants variables
2nd Data Frame: Parking’s coordinates
| Variable | Description |
|---|---|
| Name | Parking Name |
| Address | Parking address |
| Latitude | Parking latitude |
| Longitude | Parking longitude |
| Postalcode | Postalcode |
3rd Dataframe : Public transport stop coordinates
| Variable | Description |
|---|---|
| Name | Public stop Name |
| Address | Public stop address |
| Latitude | Public stop latitude |
| Longitude | Public stop longitude |
| Postalcode | Postalcode |
Code
df3$OpenedHours <- df3$OpenedHours1 + df3$OpenedHours2
df3$OpenedHours1 <- NULL
df3$OpenedHours2 <- NULL2. Geneva Restaurants
2.1 Geneva Map
Code
shapefile_data <- st_read(here::here("Data/Canton_Genève.shp"), quiet = TRUE)
# Extract the geometry information
geometry <- st_geometry(shapefile_data)
# Create a data frame without the geometry column
attributes_data <- st_drop_geometry(shapefile_data)
# Combine the geometry and attributes into a simple features data frame
sf_data <- st_sf(attributes_data, geometry = geometry)
shapefile_data <- st_transform(sf_data, crs = st_crs("+proj=longlat +datum=WGS84"))
Ge <- shapefile_data %>% filter(COMMUNE == 'Genève' | COMMUNE == 'Carouge (GE)')
##all together
map1 <- leaflet(shapefile_data) %>%
addTiles() %>%
addPolygons(fillColor = "blue", fillOpacity = 0.5, color = "white", weight = 1, label = ~COMMUNE) %>%
addPolygons(data = Ge, fillColor = "red", fillOpacity = 0.7, color = "white", weight = 2, label = ~COMMUNE)
map1We decided to also include the Carouge district because many of our restaurants in our database were from this district.
Below is to give an idea of the location of each restaurant in Geneva:
Code
geo_cols <- c("latitude", "longitude", "address")
geo_df <- df3[, geo_cols]
geneva_coords <- c(46.2044, 6.1432)
# Create a leaflet map
map <- leaflet(geo_df) %>%
addTiles() %>%
addMarkers(
clusterOptions = markerClusterOptions(),
popup = ~as.character(address),
) %>%
setView(lng = geneva_coords[2], lat = geneva_coords[1], zoom = 13)
map2.2 Number of restaurants by Postalcode
Code
total_restaurants <- df3 %>%
dplyr::group_by(Postalcode) %>%
dplyr::summarize(TotalRestaurants = n())
barplot2 <- total_restaurants %>%
plot_ly(x = ~Postalcode,
y = ~TotalRestaurants,
color = ~Postalcode, # Use Postalcode as color variable
colors = brewer.pal(9, "Set3"), # Use Set3 palette with 9 colors
type = "bar",
name = ~Postalcode) %>%
layout(title = "Number of restaurants by Postalcode") %>%
layout(xaxis = list(title = "Postalcode", showgrid = FALSE))
barplot22.3 Frequency of cuisine type & meal type
Code
cuisinetext <- df3$Cuisines %>%
tolower() %>%
str_replace_all("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", "") %>%
str_replace_all("[[:punct:]]+", "") %>%
str_replace_all("[[:digit:]]+", "") %>%
str_trim() %>%
str_replace_all("\\s+", " ") %>%
tm::removeWords(stopwords("en"))
cuisinetext1 <- cuisinetext %>%
tm::VectorSource() %>%
tm::Corpus() %>%
tm::TermDocumentMatrix()
tm::inspect(cuisinetext1)
tag_cuisine <- cuisinetext %>%
tokens() %>%
quanteda::dfm(., verbose = FALSE)
tidy_df <- tidytext::tidy(tag_cuisine)
tf_idf <- tidy_df %>%
tidytext::bind_tf_idf(term, document, count) %>%
arrange(desc(tf_idf))
tidy_words <- df3 %>%
tidytext::unnest_tokens(word, Cuisines) %>%
mutate(word = SnowballC::wordStem(word)) %>%
dplyr::select(word) %>%
plyr::count() %>%
arrange(desc(freq))
reject_words <- c("option", "brew", "barbecu", "grill", "soup", "friendli", "intern")
# Remove common words from the tidy_tweets data frame
filtered_words <- anti_join(tidy_words, data.frame(word = reject_words), by = "word")
wordcloud(words = filtered_words$word, freq = filtered_words$freq, min.freq=5, scale=c(3,0.5), colors=brewer.pal(8, "Dark2"))Code
#wordcloud with the column Features
featurestext <- df3$Features %>%
tolower() %>%
str_replace_all("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", "") %>%
str_replace_all("[[:punct:]]+", "") %>%
str_replace_all("[[:digit:]]+", "") %>%
str_trim() %>%
str_replace_all("\\s+", " ") %>%
tm::removeWords(stopwords("en"))
featurestext1 <- featurestext %>%
tm::VectorSource() %>%
tm::Corpus() %>%
tm::TermDocumentMatrix()
tm::inspect(featurestext1)
tag_features <- featurestext %>%
tokens() %>%
quanteda::dfm(., verbose = FALSE)
tidy_df1 <- tidytext::tidy(tag_features)
tf_idf2 <- tidy_df1 %>%
tidytext::bind_tf_idf(term, document, count) %>%
arrange(desc(tf_idf))
tidy_words <- df3 %>%
tidytext::unnest_tokens(word, Features) %>%
mutate(word = SnowballC::wordStem(word)) %>%
dplyr::select(word) %>%
plyr::count() %>%
arrange(desc(freq))
wordcloud(words = tidy_words$word, freq = tidy_words$freq, min.freq=5, scale=c(3,0.5), colors=brewer.pal(8, "Dark2"))2.4 Distribution of the number of reviews and rating
Code
ggplot(df3, aes(x = Number_of_reviews)) +
geom_histogram(binwidth = 50, fill = "skyblue", color = "black", alpha = 0.7) +
labs(title = "Distribution of Number of Reviews",
x = "Number of Reviews",
y = "Frequency")Code
ggplot(df3, aes(x = rating)) +
geom_histogram(binwidth = 0.5, fill = "lightgreen", color = "black", alpha = 0.7) +
labs(title = "Distribution of Ratings",
x = "Rating",
y = "Frequency")2.5 Distribution of the number of reviews and rating
Code
cuisine_cols <- c("French", "Italian", "European", "Vegetarian", "Vegan",
"Mediterranean", "Asian", "Gluten_free", "Spanish", "Swiss")
cuisine_df <- df3[, cuisine_cols]
# Melt the dataframe
melted_df <- melt(cuisine_df)
# Filter for rows where value is 1
filtered_df <- melted_df[melted_df$value == 1, ]
# Check if there are rows to plot
if (nrow(filtered_df) > 0) {
# Create a bar plot for cuisine distribution
ggplot(filtered_df, aes(x = variable, fill = factor(value))) +
geom_bar(stat = "count", position = "dodge") +
labs(title = "Main type of Cuisines Distribution",
x = "Type of Cuisines",
y = "Count") +
scale_fill_manual(values = c("1" = "salmon"), guide = FALSE) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
} else {
print("No data to plot.")
}2.6 Correlation Matrix
Code
numeric_cols <- c("Number_of_reviews", "rating", "minPrice", "maxPrice", "rankingPosition",
"Distance_neareststop", "Distance_nearestparking","Distance_to_trainstation",
"averaged_score_competition"
)
numeric_df3 <- df3[, numeric_cols]
cor_matrix <- cor(numeric_df3)
#ggcorrplot(cor_matrix,
#hc.order = TRUE,
#type = "upper", # Type of plot: "full", "lower", or "upper"
#outline.color = "white",
#colors = c("#007000", "#FFBF00", "#AC0C0C"),
#lab_size = 2,
#lab = TRUE,
#ggtheme = theme_minimal())
my_colors <- colorRampPalette(c("#007000", "#FFBF00", "#AC0C0C"))(100)
corrplot(cor_matrix, method = "color", col = my_colors)Code
cuisine_cols <- c("Number_of_reviews", "rating", "French", "Italian", "European", "Vegetarian", "Vegan", "Mediterranean", "Asian", "Gluten_free", "Spanish", "Swiss" )
cuisine_df3 <- df3[, cuisine_cols]
cor_matrix1 <- cor(cuisine_df3)
corrplot(cor_matrix1, method = "color", col = my_colors)Code
mealtype_cols <- c("Number_of_reviews", "rating", "Lunch", "Drinks", "Brunch", "Breakfast", "Dinner", "Late_Night_Drinks")
mealtype_df3 <- df3[, mealtype_cols]
cor_matrix3 <- cor(mealtype_df3)
corrplot(cor_matrix3, method = "color", col = my_colors)2.7 Additional graphs
Code
selected_cols <- c("rating", "Number_of_reviews", "mealTypes", "Cuisines")
selected_df <- df3[, selected_cols]
# Split the "Mealtype" column into separate rows
selected_df <- selected_df %>%
separate_rows(mealTypes, sep = " ")
# Create box plots for rating by meal type
ggplot(selected_df, aes(x = mealTypes, y = rating, fill = mealTypes)) +
geom_boxplot() +
labs(title = "Box Plot of Rating by Meal Type",
x = "Meal Type",
y = "Rating") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))The boxplot illustrates that ratings for various meal types—Breakfast, Brunch, Dinner, Drinks, Late, Lunch, and Night—are consistently high with median values around 4.5, indicating general customer satisfaction. The interquartile ranges are narrow, showing low variability in ratings for each meal type, and the presence of outliers for some meals suggests occasional deviations from typical ratings. Overall, there is no significant difference in the central tendency of ratings among the different meal types, implying a uniform quality of experience.
Code
ggplot(selected_df, aes(x = `Number_of_reviews`, y = rating)) +
geom_point(alpha = 0.6) +
labs(title = "Scatter Plot of Number of Reviews vs. Rating",
x = "Number of Reviews",
y = "Rating") +
theme_minimal()Code
df3$OpenedHours <- df3$OpenedHours1 + df3$OpenedHours2
df3$OpenedHours1 <- NULL
df3$OpenedHours2 <- NULL3. Analysis
Code
###Selection of specific columns for the analysis
Bigdata <- df3 %>% dplyr::select(-address, -latitude, -longitude, -Postalcode, -minPrice, -maxPrice, -Cuisines, -OpenHours1, -CloseHours1, -OpenHours2, -CloseHours2, -City, -description, -Features, -mealTypes, -Trainstation_latitude, -Trainstation_longitude, -rankingString)3.1 Relation between the individual variables
Code
variables_to_show <- c("Distance_to_trainstation", "Distance_nearestparking","Distance_neareststop",
"Distance_to_jet","Distance_to_catedral","Distance_to_patekmuseum",
"Distance_to_botanicgarden", "Distance_to_nationpalace", "Distance_to_brokenchair",
"Number_of_reviews", "rating", "averaged_price")
plot_matrix <-pairs(Bigdata[,variables_to_show], col = "blue", pch = 16)Code
df3 %>%
ggplot(aes(log(rankingPosition + 1),log(Number_of_reviews + 1))) +
geom_point()+
geom_smooth()+
xlab("ranking")+
ylab("Number of reviews")+
theme_minimal()The scatter plot shows a non-linear relationship between rankings and the number of reviews, with a peak in review quantity at mid-level rankings and fewer reviews at the extremes. The confidence interval indicates greater prediction uncertainty at the lowest and highest rankings.
Code
df3 %>% filter(df3$rating > 3) %>%
ggplot(aes(rating,log(Number_of_reviews+1))) +
geom_point()+
geom_smooth()+
xlab("rating")+
ylab("Number of reviews")+
theme_minimal()The scatter plot displays a trend where the number of reviews peaks around a rating of 4.0, diminishes towards a rating of 4.5, and then slightly increases again at a perfect rating of 5.0. The confidence interval widens as the ratings approach the extremes, indicating more variability in the number of reviews for exceptionally high and low ratings.
Code
df3 %>% filter(df3$rating > 3) %>%
ggplot(aes(rating,log(rankingPosition +1))) +
geom_point()+
geom_smooth()+
xlab("rating")+
ylab("Ranking Position")+
theme_minimal()Code
df3 %>%
ggplot(aes(log(photoCount + 1),log(Number_of_reviews + 1))) +
geom_point()+
geom_smooth()+
xlab("Photo Count")+
ylab("Number of reviews")+
theme_minimal()Code
df3 %>% filter(df3$averaged_price < 300) %>%
ggplot(aes(averaged_price,Number_of_reviews)) +
geom_point()+
geom_smooth()+
xlab("averaged price")+
ylab("Number of reviews")+
theme_minimal()The scatter plot indicates that the number of reviews tends to be higher for items with lower average prices, with the number of reviews decreasing as the average price increases. The fitted line shows a slight negative trend, and the confidence interval becomes wider with increasing price, suggesting less certainty about the number of reviews for higher-priced items.
Code
df3 %>%
ggplot(aes(log(Distance_to_trainstation + 1), log(Number_of_reviews + 1))) +
geom_point()+
geom_smooth()+
xlab("Number of reviews")+
ylab("Distance Train station")+
theme_minimal()Code
df3 %>% filter(df3$Distance_nearestparking < 5000) %>%
ggplot(aes(log(Distance_nearestparking + 1), log(Number_of_reviews + 1), )) +
geom_point()+
geom_smooth()+
xlab("Number of reviews")+
ylab("Distance parking")+
theme_minimal()Code
df3 %>%
ggplot(aes(log(Distance_to_catedral + 1),log(Number_of_reviews + 1))) +
geom_point()+
geom_smooth()+
xlab("Number of reviews")+
ylab("Distance to catedral")+
theme_minimal()Code
df3 %>%
ggplot(aes(log(Distance_neareststop + 1),log(Number_of_reviews + 1 ))) +
geom_point()+
geom_smooth()+
xlab("Number of reviews")+
ylab("Distance Nearest stop")+
theme_minimal()Code
df3 %>%
ggplot(aes(log(Distance_to_jet + 1), log(Number_of_reviews + 1))) +
geom_point()+
geom_smooth()+
xlab("Number of reviews")+
ylab("Distance jet d'eau ")+
theme_minimal()Code
df3 %>%
ggplot(aes(log(Distance_to_patekmuseum + 1),log(Number_of_reviews + 1))) +
geom_point()+
geom_smooth()+
xlab("Number of reviews")+
ylab("Distance Patek ")+
theme_minimal()Code
df3 %>%
ggplot(aes(log(Distance_to_nationpalace + 1),log(Number_of_reviews + 1))) +
geom_point()+
geom_smooth()+
xlab("Number of reviews")+
ylab("Distance ONU ")+
theme_minimal()Code
df3 %>%
ggplot(aes(log(Distance_to_brokenchair + 1),log(Number_of_reviews + 1))) +
geom_point()+
geom_smooth()+
xlab("Number of reviews")+
ylab("Distance Broken chair ")+
theme_minimal()Code
df3 %>%
ggplot(aes(log(Distance_to_botanicgarden + 1),log(Number_of_reviews + 1))) +
geom_point()+
geom_smooth()+
xlab("Number of reviews")+
ylab("Distance Botanic garden ")+
theme_minimal()Code
##Creation top 100 restaurants based on rating
sorted_df <- Bigdata[order(Bigdata$rating, decreasing = TRUE), ]
top_100_restaurants_rating <- head(sorted_df, 100)
##Creation worst 100 restaurants based on rating
sorted_df1 <- Bigdata[order(Bigdata$rating), ]
worst_100_restaurants_rating <- head(sorted_df1, 100)
averages_df <- data.frame(
Category = c("Best", "Worst"),
AvgDistParking = c(mean(top_100_restaurants_rating$Distance_nearestparking),
mean(worst_100_restaurants_rating$Distance_nearestparking)),
AvgDistToStop = c(mean(top_100_restaurants_rating$Distance_neareststop),
mean(worst_100_restaurants_rating$Distance_neareststop)),
AvgDistToTrain = c(mean(top_100_restaurants_rating$Distance_to_trainstation),
mean(worst_100_restaurants_rating$Distance_to_trainstation)),
AvgDistToJet = c(mean(top_100_restaurants_rating$Distance_to_jet),
mean(worst_100_restaurants_rating$Distance_to_jet)),
AvgDistToCatedral = c(mean(top_100_restaurants_rating$Distance_to_catedral),
mean(worst_100_restaurants_rating$Distance_to_catedral)),
AvgDistToPatek = c(mean(top_100_restaurants_rating$Distance_to_patekmuseum),
mean(worst_100_restaurants_rating$Distance_to_patekmuseum)),
AvgDistToBotanic = c(mean(top_100_restaurants_rating$Distance_to_botanicgarden),
mean(worst_100_restaurants_rating$Distance_to_botanicgarden)),
AvgDistToONU = c(mean(top_100_restaurants_rating$Distance_to_nationpalace),
mean(worst_100_restaurants_rating$Distance_to_nationpalace)),
AvgDistToBrokenchair = c(mean(top_100_restaurants_rating$Distance_to_brokenchair),
mean(worst_100_restaurants_rating$Distance_to_brokenchair))
)
averages_long <- tidyr::gather(averages_df, key = "DistanceType", value = "AverageDistance", -Category)
##barplot with the distance but the restaurant are ranked according to their rating
bar_plot <- ggplot(averages_long, aes(x = DistanceType, y = AverageDistance, fill = Category)) +
geom_bar(stat = "identity", position = "dodge", alpha = 0.7) +
labs(title = "Average Distance to Location - Best vs Worst Restaurants",
x = "Distance Type",
y = "Average Distance") +
scale_fill_manual(values = c("Best" = "green", "Worst" = "red")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(bar_plot)
We wanted to know if there was a significant difference in the distance to the parking, the public or the train station between the best and worst restaurants.As we can see, there is no significant difference.
Code
##Creation top 100 restaurants based on number of reviews
sorted_df1.1 <- Bigdata[order(Bigdata$Number_of_reviews, decreasing = TRUE), ]
top_100_restaurants_reviews <- head(sorted_df1.1, 100)
##Creation worst 100 restaurants based on number of reviews
sorted_df1.2 <- Bigdata[order(Bigdata$Number_of_reviews), ]
worst_100_restaurants_reviews <- head(sorted_df1.2, 100)
averages_df1 <- data.frame(
Category = c("Highest", "Lowest"),
AvgDistParking = c(mean(top_100_restaurants_reviews$Distance_nearestparking),
mean(worst_100_restaurants_reviews$Distance_nearestparking)),
AvgDistToStop = c(mean(top_100_restaurants_reviews$Distance_neareststop),
mean(worst_100_restaurants_reviews$Distance_neareststop)),
AvgDistToTrain = c(mean(top_100_restaurants_reviews$Distance_to_trainstation),
mean(worst_100_restaurants_reviews$Distance_to_trainstation)),
AvgDistToJet = c(mean(top_100_restaurants_reviews$Distance_to_jet),
mean(worst_100_restaurants_reviews$Distance_to_jet)),
AvgDistToCatedral = c(mean(top_100_restaurants_reviews$Distance_to_catedral),
mean(worst_100_restaurants_reviews$Distance_to_catedral)),
AvgDistToPatek = c(mean(top_100_restaurants_reviews$Distance_to_patekmuseum),
mean(worst_100_restaurants_reviews$Distance_to_patekmuseum)),
AvgDistToBotanic = c(mean(top_100_restaurants_reviews$Distance_to_botanicgarden),
mean(worst_100_restaurants_reviews$Distance_to_botanicgarden)),
AvgDistToONU = c(mean(top_100_restaurants_reviews$Distance_to_nationpalace),
mean(worst_100_restaurants_reviews$Distance_to_nationpalace)),
AvgDistToBrokenchair = c(mean(top_100_restaurants_reviews$Distance_to_brokenchair),
mean(worst_100_restaurants_reviews$Distance_to_brokenchair))
)
averages_long1 <- tidyr::gather(averages_df1, key = "DistanceType", value = "AverageDistance", -Category)
##barplot with the distances but the restaurants are ranked according to their number of reviews
bar_plot1 <- ggplot(averages_long1, aes(x = DistanceType, y = AverageDistance, fill = Category)) +
geom_bar(stat = "identity", position = "dodge", alpha = 0.7) +
labs(title = "Average Distance to Location - High vs Low Restaurants Reviews",
x = "Distance Type",
y = "Average Distance") +
scale_fill_manual(values = c("Highest" = "green", "Lowest" = "red")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(bar_plot1)
We can see that the restaurants with the lowest number of reviews are far away from the Botanic garden, the ONU Palace and the train station.
3.2 Factor Analysis
First of all, we decided to make a PCA graph with all the distances variables that we had. The goal was to make a comparison a bit later, after having done the Principal Component Analysis to see which variables are contributing in a same way for a specific dimension.
Code
##Factor analysis with all distances (or just focusing on one) interesting
data_for_fa <- Bigdata %>%
dplyr::select(Distance_to_trainstation, Distance_nearestparking,Distance_neareststop,
Distance_to_jet,Distance_to_catedral,Distance_to_patekmuseum,
Distance_to_botanicgarden, Distance_to_nationpalace, Distance_to_brokenchair)
myPCA <- FactoMineR:: PCA(data_for_fa)Code
#fviz_pca_ind(myPCA,
#geom.ind = "point",
#col.ind = "cos2",
#palette = "jco",
#addEllipses = TRUE,
#ellipse.type = "confidence",
#repel = TRUE)Code
# Principal component with all the distances variables
vectordistances <- c("Distance_to_trainstation", "Distance_nearestparking","Distance_neareststop",
"Distance_to_jet","Distance_to_catedral","Distance_to_patekmuseum","Distance_to_botanicgarden", "Distance_to_nationpalace", "Distance_to_brokenchair")
distances.pc <- prcomp(Bigdata[,vectordistances])
#summary(distances.pc)
#distances.pc$x
fviz_eig(distances.pc, geom="line")Code
##Save the component in our df. Based on what we saw, we could rename by the distance that we really have
#Bigdata$distances1 <- distances.pc$x[,1]
#Bigdata$distances2 <- distances.pc$x[,2]
#Bigdata$distances3 <- distances.pc$x[,3]The Principal Component Analysis (PCA) of distance-related features reveals insightful patterns in the data. The scree plot illustrates the variance captured by each principal component (PC). It is evident that the first three PCs contribute significantly to explaining the variability in the distances dataset. The eigenvalues associated with the PCs sharply decline after the third component, suggesting diminishing returns in explanatory power beyond this point
Code
var <- get_pca_var(distances.pc)
a<-fviz_contrib(distances.pc, "var",axes = 1)
b<-fviz_contrib(distances.pc, "var",axes = 2)
c<-fviz_contrib(distances.pc, "var",axes = 3)
grid.arrange(a,b,c,top='Contribution to the Principal Components')As we can see on the graph above, the variables “Distance_to_botanicgarden”, “Distance_to_nationpalace” and “Distance_to_brokenchair” are all contributing positively to PC1. It suggests that an increase in these distances is associated with an increase in PC1. In order to reduce dimensionality, we chose one of those variables that represents the overall theme of the group, “Distance_to_nationpalace”.
In the same way, we can only select “Distance_to_catedral” for the dimension 2 and “Distance_to_jet” for the dimension 3.
Code
library(plotly)
library(scatterplot3d) # for PCA analysis
# Assuming distances.pc is the result of a PCA
# Perform PCA analysis here if not already done
# pca_result <- prcomp(your_data, scale. = TRUE)
# distances.pc <- pca_result
# Create a dataframe for the scatter plot of PCA scores
pca_scores <- data.frame(distances.pc$x[, 1:3])
names(pca_scores) <- c("PC1", "PC2", "PC3")
# Create a dataframe for the arrows
arrows <- data.frame(
x = rep(0, nrow(distances.pc$rotation)),
y = rep(0, nrow(distances.pc$rotation)),
z = rep(0, nrow(distances.pc$rotation)),
u = distances.pc$rotation[, 1],
v = distances.pc$rotation[, 2],
w = distances.pc$rotation[, 3]
)
# First plot the PCA scores
p <- plot_ly(data = pca_scores, x = ~PC1, y = ~PC2, z = ~PC3, type = 'scatter3d', mode = 'markers',
marker = list(size = 2, color = 'blue')) %>%
add_markers()
# Then add the arrows for each principal component loading
for(i in 1:nrow(arrows)) {
p <- p %>% add_trace(
type = "cone",
x = c(0, arrows$x[i]),
y = c(0, arrows$y[i]),
z = c(0, arrows$z[i]),
u = c(0, arrows$u[i]),
v = c(0, arrows$v[i]),
w = c(0, arrows$w[i]),
anchor = "tail",
showscale = FALSE,
sizemode = "absolute",
sizeref = 0.1,
opacity = 0.6
)
}
# Finalize the layout
p <- p %>% layout(
scene = list(
xaxis = list(title = 'Distance ONU'),
yaxis = list(title = 'Distance Catedral'),
zaxis = list(title = 'Distance Jet'),
aspectmode = 'cube'
),
title = "3D PCA Visualization"
)
# Show the plot
p
In the PCA plot of the variables, we see that the variables Distance_to_botanicgarden, Distance_to_brokenchair and Distance_to_nationpalace are correlated, as well as the variables Distance_neareststop, Distance_to_catedral and Distance_nearestparking.
In the graph representing the 3 clusters, dimension 1 is an average between the correlated variables Distance_to_botanicgarden, Distance_to_brokenchair and Distance_to_nationpalace; and dimension 2 is an average between the correlated variables Distance_neareststop, Distance_to_catedral and Distance_nearestparking.
Code
##Factor analysis with all distances (or just focusing on one) interesting
data_for_fa <- Bigdata %>%
dplyr::select(Distance_to_trainstation, Distance_nearestparking,Distance_neareststop,Distance_to_jet,Distance_to_catedral, Distance_to_nationpalace)
myPCA <- FactoMineR:: PCA(data_for_fa)Code
#fviz_pca_ind(myPCA,
#geom.ind = "point",
#col.ind = "cos2",
#palette = "jco",
#addEllipses = TRUE,
#ellipse.type = "confidence",
#repel = TRUE)The PCA (Principal Component Analysis) graph visualizes the relative importance and contribution of various distance-related variables to the first two principal components. Dimension 1 (Dim 1) on the x-axis explains 40.75% of the variance and is strongly influenced by ‘Distance_to_jet’, ‘Distance_nearestparking’, and ‘Distance_neareststop’, suggesting these variables are correlated and may represent a similar aspect of the data. Dimension 2 (Dim 2) on the y-axis accounts for 29.35% of the variance and is most influenced by ‘Distance_to_nationalpalace’ and ‘Distance_to_trainstation’, indicating these are distinct factors that contribute differently to the dataset’s variance. We definitely could see the difference with the first one above.
3.3 Formation of clusters
Code
##Cluster according to all the distances that we have
data_for_cluster <- Bigdata %>%
dplyr::select(Distance_to_trainstation, Distance_nearestparking,Distance_neareststop,Distance_to_jet,Distance_to_catedral, Distance_to_nationpalace)
scaled_features <- scale(data_for_cluster)
result <- NbClust(scaled_features, distance = "euclidean", method = "kmeans", min.nc = 2, max.nc = 10, index = "all")*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 6 proposed 2 as the best number of clusters
* 8 proposed 3 as the best number of clusters
* 1 proposed 5 as the best number of clusters
* 6 proposed 6 as the best number of clusters
* 1 proposed 9 as the best number of clusters
* 2 proposed 10 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 3
*******************************************************************
Code
## 3 clusters
kmeans_result <- kmeans(scaled_features, centers = 3, nstart = 25)
fviz_cluster_object <- fviz_cluster(kmeans_result, data = scaled_features,
repel = TRUE, # To avoid text overlapping
show.clust.cent = TRUE,
palette = c("#AC0C0C","#FFBF00", "#007000"),
geom = "point", # Removed "text" to avoid clutter
ellipse.type = "convex",
ggtheme = theme_bw()) +
geom_point(size = 2, alpha = 0) # Adjust size and transparency
# Now you can adjust the scale manually by changing the limits if needed
fviz_cluster_object <- fviz_cluster_object +
xlim(c(-13, 5)) +
ylim(c(-6, 5))
# Print the plot
print(fviz_cluster_object)Here the k-means center for each cluster, depending on the variables choosen for making the cluster: ::: {.cell}
Code
result_table <- kmeans_result$centers
kable(result_table, "html") %>%
kable_styling(full_width = FALSE)| Distance_to_trainstation | Distance_nearestparking | Distance_neareststop | Distance_to_jet | Distance_to_catedral | Distance_to_nationpalace |
|---|---|---|---|---|---|
| -1.0055810 | -0.2607063 | 0.015542 | -0.2918398 | 0.2663589 | -0.9165445 |
| 0.5213387 | -0.0075770 | -0.178913 | -0.0211579 | -0.4212910 | 0.6570413 |
| 1.5411635 | 2.0818411 | 1.989832 | 2.4798509 | 2.9294726 | -0.7386820 |
:::
3.4 Regression Tree
3.4.1 Regression Tree with Number of reviews
Code
##Based on our dataset Bigdata that contains all the variables that we want to include in our model
###Tree made with the variable number of reviews
Bigdata1 <- Bigdata %>% dplyr::select(-c(photoCount, rating, Distance_to_patekmuseum, Distance_to_botanicgarden, Distance_to_botanicgarden, rawRanking ))
set.seed(123)
indices <- Bigdata1$Number_of_reviews %>%
as.character() %>%
createDataPartition(
p = 0.8,
list = FALSE)
train = Bigdata1[indices,]
validation = Bigdata1[-indices,]
Dtree1 = rpart(Number_of_reviews ~.,
data = train,
control=list(cp=.01, xval=10),
parms = list(split = "gini"))
#summary(Dtree1)
rpart.plot(Dtree1)3.4.2 Regression Tree with rating
Code
##we split our model excluding the variable number of reviews. Tree made with rating
Bigdata2 <- Bigdata %>% dplyr::select(-c(Number_of_reviews, photoCount,Distance_to_patekmuseum, Distance_to_botanicgarden, Distance_to_botanicgarden, rawRanking ))
set.seed(123)
indices <- Bigdata2$rating %>%
as.character() %>%
createDataPartition(
p = 0.8,
list = FALSE)
train = Bigdata2[indices,]
validation = Bigdata2[-indices,]
#### Gini
#Dtree1 = rpart(rating ~.,
#data = train,
#control=list(cp=.01, xval=10),
#parms = list(split = "gini")) # default split is Gini, write "information" otherwise.
#summary(Dtree1)
# Plot tree
#plot(Dtree1, margin = 0.05)
#text(Dtree1, use.n = TRUE, cex = 0.7)
##### Information gain
Dtree2 = rpart(rating ~.,
data = train,
control=list(cp=.01, xval=10),
parms = list(split = "information")) # information gain based on entropy
# Plot trees
#par(mfrow = c(1,2), mar = rep(0.3, 4))
#plot(Dtree1, margin = 0.05); text(Dtree1, use.n = TRUE, cex = 0.3)
#plot(Dtree2, margin = 0.05); text(Dtree2, use.n = TRUE, cex = 0.3)
rpart.plot(Dtree2, extra = 101, under = TRUE, cex = 0.8)3.5 Multiple Regression
3.5.1 Number of Reviews
Code
corr_matrixreviews <-
Bigdata %>% cor(use = "complete.obs") %>% round(digits = 4)
corr_matrixreviews <- corr_matrixreviews %>% knitr::kable(caption = "Correlation matrix",
align = 'c',
digits = 3) %>%
kableExtra:: kable_styling(c("striped", "bordered"),
full_width = FALSE,
position = "center")
corr_matrixreviewsWe decided to calculate the following regression, which considers all the variables presented above, to determine the Number of reviews :
Number of Reviews =\beta_0 + \beta_1* rating + \beta_2 *photoCount + \beta_3*rankingPosition + \\ \beta_4*OpenedHours + \beta_5*Distancetrainstation + \beta_6 * Distanceneareststop + \\ \beta_7*Distancecatedral + \beta_8*DistanceJet +\beta_9*DistanceNationPalace + \\ \beta_{10}*Scorecompetition + \beta_{11}*Averagedprice + \beta_{11}*French + \beta_{12}*Italien \\ + \beta_{13}*European + \beta_{14}*Mediterranean +\beta_{15}*Vegeterian + \beta_{17}*Vegan \\ + \beta_{18}*Asian + \beta_{19}*Glutenfree + \beta_{20}*Spanish + \beta_{21}*Swiss + \beta_{22}*Lunch \\ + \beta_{23}*Brunch + \beta_{24}*Breakfast + \beta_{25}*Dinner + \beta_{26}*Drinks + \beta_{27}*LateNightDrinks
Code
completemodel <- lm(Number_of_reviews ~ rating + photoCount + rankingPosition + OpenedHours + averaged_score_competition+ averaged_price+ French + Italian + European +Vegetarian + Vegan + Mediterranean + Asian + Gluten_free + Spanish + Swiss + Lunch + Dinner +Drinks + Brunch + Breakfast + Late_Night_Drinks +log(Distance_to_trainstation)+ log(Distance_nearestparking) + log(Distance_neareststop) + log(Distance_to_jet)+ log(Distance_to_catedral) + log(Distance_to_nationpalace),Bigdata)Code
summary(completemodel)
Call:
lm(formula = Number_of_reviews ~ rating + photoCount + rankingPosition +
OpenedHours + averaged_score_competition + averaged_price +
French + Italian + European + Vegetarian + Vegan + Mediterranean +
Asian + Gluten_free + Spanish + Swiss + Lunch + Dinner +
Drinks + Brunch + Breakfast + Late_Night_Drinks + log(Distance_to_trainstation) +
log(Distance_nearestparking) + log(Distance_neareststop) +
log(Distance_to_jet) + log(Distance_to_catedral) + log(Distance_to_nationpalace),
data = Bigdata)
Residuals:
Min 1Q Median 3Q Max
-913.56 -48.87 -3.01 35.83 1369.67
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.030e+03 2.217e+02 4.644 4.30e-06 ***
rating -1.114e+02 1.695e+01 -6.571 1.17e-10 ***
photoCount 1.650e+00 7.267e-02 22.700 < 2e-16 ***
rankingPosition -1.405e-01 3.761e-02 -3.735 0.000207 ***
OpenedHours 4.361e+00 1.377e+00 3.167 0.001628 **
averaged_score_competition -7.976e+00 3.507e+01 -0.227 0.820186
averaged_price 1.919e-03 4.067e-03 0.472 0.637162
French -8.363e+00 1.666e+01 -0.502 0.615828
Italian 5.147e+00 1.808e+01 0.285 0.776008
European 8.624e+00 1.594e+01 0.541 0.588646
Vegetarian -1.881e+01 1.485e+01 -1.267 0.205821
Vegan -5.399e+00 1.691e+01 -0.319 0.749641
Mediterranean -2.205e+01 1.821e+01 -1.211 0.226373
Asian 3.675e+00 1.919e+01 0.192 0.848184
Gluten_free -5.093e+00 2.069e+01 -0.246 0.805692
Spanish 7.789e+00 3.848e+01 0.202 0.839688
Swiss 2.333e+00 2.056e+01 0.113 0.909718
Lunch 1.967e+01 2.392e+01 0.822 0.411157
Dinner -2.195e+01 2.460e+01 -0.892 0.372638
Drinks -5.093e+01 1.428e+01 -3.566 0.000395 ***
Brunch 3.162e+01 2.299e+01 1.376 0.169501
Breakfast -3.907e+01 1.919e+01 -2.036 0.042248 *
Late_Night_Drinks 5.315e+01 1.990e+01 2.671 0.007800 **
log(Distance_to_trainstation) 3.698e+00 1.172e+01 0.316 0.752476
log(Distance_nearestparking) -1.102e+01 9.114e+00 -1.209 0.227226
log(Distance_neareststop) -9.324e+00 9.113e+00 -1.023 0.306684
log(Distance_to_jet) 2.258e+01 1.531e+01 1.475 0.140761
log(Distance_to_catedral) -4.059e+01 1.282e+01 -3.167 0.001629 **
log(Distance_to_nationpalace) -3.218e+01 2.142e+01 -1.502 0.133716
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 142.7 on 543 degrees of freedom
Multiple R-squared: 0.6912, Adjusted R-squared: 0.6753
F-statistic: 43.41 on 28 and 543 DF, p-value: < 2.2e-16
Code
#tab_model(completemodel)The model has an overall good fit with a Multiple R-squared of 0.6912, indicating that approximately 69.12% of the variance in the number of reviews is explained by the included variables. The p-value (< 2.2e-16) of the F-statistic suggests that the model is statistically significant.
Then we decided to make another regression by kicking out the rating and photoCount variables:
Code
#completemodel1 <- lm(Number_of_reviews ~ ., dplyr::select(Bigdata, -rating, -photoCount))
completemodel1 <- lm(Number_of_reviews ~rankingPosition + OpenedHours + averaged_score_competition+ averaged_price+ French + Italian + European +Vegetarian + Vegan + Mediterranean + Asian + Gluten_free + Spanish + Swiss + Lunch + Dinner +Drinks + Brunch + Breakfast + Late_Night_Drinks +log(Distance_to_trainstation)+ log(Distance_nearestparking) + log(Distance_neareststop) + log(Distance_to_jet)+ log(Distance_to_catedral) + log(Distance_to_nationpalace),Bigdata)
summary(completemodel1)
Call:
lm(formula = Number_of_reviews ~ rankingPosition + OpenedHours +
averaged_score_competition + averaged_price + French + Italian +
European + Vegetarian + Vegan + Mediterranean + Asian + Gluten_free +
Spanish + Swiss + Lunch + Dinner + Drinks + Brunch + Breakfast +
Late_Night_Drinks + log(Distance_to_trainstation) + log(Distance_nearestparking) +
log(Distance_neareststop) + log(Distance_to_jet) + log(Distance_to_catedral) +
log(Distance_to_nationpalace), data = Bigdata)
Residuals:
Min 1Q Median 3Q Max
-360.15 -96.54 -18.05 52.84 1814.73
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.366e+03 3.033e+02 4.505 8.14e-06 ***
rankingPosition -3.842e-01 4.806e-02 -7.994 7.85e-15 ***
OpenedHours 7.100e+00 1.986e+00 3.575 0.000381 ***
averaged_score_competition -1.262e+02 5.025e+01 -2.512 0.012291 *
averaged_price 1.515e-03 5.866e-03 0.258 0.796234
French 1.111e+01 2.409e+01 0.461 0.644769
Italian 8.644e+00 2.614e+01 0.331 0.741002
European 3.497e+01 2.282e+01 1.532 0.125997
Vegetarian 9.438e+00 2.105e+01 0.448 0.654040
Vegan -5.271e+00 2.450e+01 -0.215 0.829749
Mediterranean -2.787e+01 2.630e+01 -1.059 0.289842
Asian -2.188e+01 2.764e+01 -0.791 0.429078
Gluten_free 1.249e+02 2.891e+01 4.321 1.85e-05 ***
Spanish -3.772e+01 5.557e+01 -0.679 0.497614
Swiss 1.538e+01 2.977e+01 0.516 0.605740
Lunch 1.478e+01 3.449e+01 0.428 0.668510
Dinner -9.685e+00 3.563e+01 -0.272 0.785880
Drinks -6.303e+01 2.065e+01 -3.052 0.002383 **
Brunch 7.303e+01 3.311e+01 2.205 0.027848 *
Breakfast -1.108e+02 2.746e+01 -4.035 6.25e-05 ***
Late_Night_Drinks 6.781e+01 2.879e+01 2.355 0.018876 *
log(Distance_to_trainstation) 1.105e+01 1.698e+01 0.651 0.515380
log(Distance_nearestparking) -1.149e+01 1.320e+01 -0.870 0.384651
log(Distance_neareststop) -1.705e+01 1.320e+01 -1.292 0.197000
log(Distance_to_jet) 2.282e+01 2.217e+01 1.029 0.303761
log(Distance_to_catedral) -5.663e+01 1.852e+01 -3.057 0.002342 **
log(Distance_to_nationpalace) -4.142e+01 3.104e+01 -1.335 0.182581
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 206.8 on 545 degrees of freedom
Multiple R-squared: 0.3491, Adjusted R-squared: 0.318
F-statistic: 11.24 on 26 and 545 DF, p-value: < 2.2e-16
The model has an overall good fit with a Multiple R-squared of 0.318, indicating that approximately 31.8% of the variance in the number of reviews is explained by the included variables. The p-value (< 2.2e-16) of the F-statistic suggests that the model is statistically significant.
We then decided to apply the backward induction:
Code
###Backward induction
null_model <- lm(Number_of_reviews ~ 1, data = Bigdata)
final_model <- step(completemodel1, scope = list(lower = null_model, upper = completemodel), direction = "backward")Code
summary(final_model)
Call:
lm(formula = Number_of_reviews ~ rankingPosition + OpenedHours +
averaged_score_competition + European + Gluten_free + Drinks +
Brunch + Breakfast + Late_Night_Drinks + log(Distance_to_catedral),
data = Bigdata)
Residuals:
Min 1Q Median 3Q Max
-351.78 -94.79 -22.78 52.42 1861.68
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1075.3942 175.0917 6.142 1.55e-09 ***
rankingPosition -0.3892 0.0451 -8.629 < 2e-16 ***
OpenedHours 7.3458 1.8758 3.916 0.000101 ***
averaged_score_competition -125.5517 40.9100 -3.069 0.002252 **
European 50.9877 17.7336 2.875 0.004191 **
Gluten_free 121.7124 26.3883 4.612 4.94e-06 ***
Drinks -64.0061 20.2081 -3.167 0.001622 **
Brunch 70.1870 32.3550 2.169 0.030481 *
Breakfast -102.1132 26.2873 -3.885 0.000115 ***
Late_Night_Drinks 63.3349 27.8000 2.278 0.023088 *
log(Distance_to_catedral) -48.8540 13.5522 -3.605 0.000340 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 205.4 on 561 degrees of freedom
Multiple R-squared: 0.3393, Adjusted R-squared: 0.3275
F-statistic: 28.81 on 10 and 561 DF, p-value: < 2.2e-16
By applying backward selection based on AIC, we are told that some of the variables are rejected. Our final regression is then :
Number of Reviews =\beta_0 + \beta_1*rankingPosition + \beta_2*OpenedHours \\+ \beta_{3}*ScoreCompetition + \beta_{4}*European + \beta_5*log(Distancecatedral) +\beta_{6}*Glutenfree+\beta_{7}*Breakfast + \beta_{8}*Brunch + \beta_{9}*Drinks + \\ \beta_{10}*LateNightDrinks
We check if there is a multi-collinearity issue:
Code
olsrr::ols_vif_tol(final_model) %>% kableExtra::kable(digits = 3) %>% kableExtra::kable_styling(c("striped", "bordered")) %>% kableExtra::scroll_box(width = "100%", height = "300px")| Variables | Tolerance | VIF |
|---|---|---|
| rankingPosition | 0.799 | 1.251 |
| OpenedHours | 0.861 | 1.161 |
| averaged_score_competition | 0.913 | 1.095 |
| European | 0.949 | 1.054 |
| Gluten_free | 0.845 | 1.183 |
| Drinks | 0.734 | 1.363 |
| Brunch | 0.883 | 1.132 |
| Breakfast | 0.791 | 1.265 |
| Late_Night_Drinks | 0.754 | 1.326 |
| log(Distance_to_catedral) | 0.918 | 1.090 |
No one seems to have a severe issue since the VIF is below 5.
Code
forecast::accuracy(final_model) %>% tibble::as_tibble() %>% dplyr::select(RMSE, MAE, MASE) %>%
kableExtra::kable(caption = "Accuracy of the Linear Model", align = 'c') %>%
kableExtra::kable_styling(c("striped", "bordered"),
full_width = FALSE,
position = "center")| RMSE | MAE | MASE |
|---|---|---|
| 203.3935 | 110.6714 | 0.8318808 |
Code
lindia::gg_qqplot(final_model)An RMSE of 203 indicates that, on average, the model’s predictions are off by approximately 203 units from the actual values. An MAE of 110 means that, on average, the model’s predictions deviate by approximately 110 units from the actual values.
3.5.2 Rating
We decided to calculate the following regression, which considers all the variables presented above, to determine the Rating :
Rating =\beta_0 + \beta_1* rating + \beta_2 *photoCount + \beta_3*rankingPosition + \\ \beta_4*OpenedHours + \beta_5*Distancetrainstation + \beta_6 * Distanceneareststop + \\ \beta_7*Distancecatedral + \beta_8*DistanceJet +\beta_9*DistanceNationPalace + \\ \beta_{10}*Scorecompetition + \beta_{11}*Averagedprice + \beta_{11}*French + \\ \beta_{12}*Italien + \beta_{13}*European + \beta_{14}*Mediterranean +\beta_{15}*Vegeterian + \beta_{17}*Vegan \\ + \beta_{18}*Asian + \beta_{19}*Glutenfree + \beta_{20}*Spanish + \beta_{21}*Swiss + \\ \beta_{22}*Lunch + \beta_{23}*Brunch + \beta_{24}*Breakfast + \beta_{25}*Dinner + \\ \beta_{26}*Drinks + \beta_{27}*LateNightDrinks
Code
completemodelrating <- lm(rating ~ Number_of_reviews + photoCount + rankingPosition + OpenedHours + averaged_score_competition+ averaged_price+ French + Italian + European +Vegetarian + Vegan + Mediterranean + Asian + Gluten_free + Spanish + Swiss + Lunch + Dinner +Drinks + Brunch + Breakfast + Late_Night_Drinks +log(Distance_to_trainstation)+ log(Distance_nearestparking) + log(Distance_neareststop) + log(Distance_to_jet)+ log(Distance_to_catedral) + log(Distance_to_nationpalace),Bigdata)Code
summary(completemodelrating)
Call:
lm(formula = rating ~ Number_of_reviews + photoCount + rankingPosition +
OpenedHours + averaged_score_competition + averaged_price +
French + Italian + European + Vegetarian + Vegan + Mediterranean +
Asian + Gluten_free + Spanish + Swiss + Lunch + Dinner +
Drinks + Brunch + Breakfast + Late_Night_Drinks + log(Distance_to_trainstation) +
log(Distance_nearestparking) + log(Distance_neareststop) +
log(Distance_to_jet) + log(Distance_to_catedral) + log(Distance_to_nationpalace),
data = Bigdata)
Residuals:
Min 1Q Median 3Q Max
-0.85644 -0.23262 -0.03209 0.23060 1.04832
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.270e+00 5.195e-01 8.219 1.52e-15 ***
Number_of_reviews -6.614e-04 1.006e-04 -6.571 1.17e-10 ***
photoCount 6.267e-04 2.457e-04 2.550 0.01103 *
rankingPosition -7.213e-04 8.750e-05 -8.244 1.26e-15 ***
OpenedHours -2.899e-03 3.384e-03 -0.857 0.39194
averaged_score_competition 1.781e-01 8.512e-02 2.093 0.03686 *
averaged_price 2.215e-05 9.865e-06 2.245 0.02514 *
French 2.194e-02 4.059e-02 0.540 0.58912
Italian 6.713e-02 4.396e-02 1.527 0.12738
European -1.255e-01 3.847e-02 -3.262 0.00118 **
Vegetarian -1.811e-01 3.540e-02 -5.117 4.32e-07 ***
Vegan -1.935e-02 4.120e-02 -0.470 0.63882
Mediterranean -8.771e-02 4.427e-02 -1.981 0.04807 *
Asian -7.776e-02 4.664e-02 -1.667 0.09609 .
Gluten_free -4.510e-03 5.043e-02 -0.089 0.92877
Spanish 1.721e-01 9.348e-02 1.841 0.06610 .
Swiss 2.990e-02 5.009e-02 0.597 0.55081
Lunch -1.085e-01 5.813e-02 -1.866 0.06255 .
Dinner -5.289e-02 5.995e-02 -0.882 0.37803
Drinks 1.799e-02 3.520e-02 0.511 0.60954
Brunch 9.421e-02 5.597e-02 1.683 0.09291 .
Breakfast 5.990e-02 4.687e-02 1.278 0.20184
Late_Night_Drinks -2.669e-02 4.880e-02 -0.547 0.58468
log(Distance_to_trainstation) 7.645e-03 2.856e-02 0.268 0.78903
log(Distance_nearestparking) 5.243e-03 2.224e-02 0.236 0.81369
log(Distance_neareststop) -9.053e-03 2.222e-02 -0.407 0.68392
log(Distance_to_jet) -1.133e-02 3.737e-02 -0.303 0.76184
log(Distance_to_catedral) 2.186e-02 3.151e-02 0.694 0.48814
log(Distance_to_nationpalace) -3.169e-02 5.230e-02 -0.606 0.54483
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3478 on 543 degrees of freedom
Multiple R-squared: 0.2596, Adjusted R-squared: 0.2215
F-statistic: 6.801 on 28 and 543 DF, p-value: < 2.2e-16
The regression model was fitted to predict the rating of a restaurant based on various features. The model shows that the number of reviews, ranking position, and several other factors significantly influence the restaurant’s rating. The overall model has an adjusted R-squared value of 0.2215, indicating that the included variables explain about 22.15% of the variability in the restaurant ratings.
Then we decided to make another regression by kicking out the Number of Reviews and photoCount variables:
Code
completemodelrating1 <- lm(rating ~rankingPosition + OpenedHours + averaged_score_competition+ averaged_price+ French + Italian + European +Vegetarian + Vegan + Mediterranean + Asian + Gluten_free + Spanish + Swiss + Lunch + Dinner +Drinks + Brunch + Breakfast + Late_Night_Drinks +log(Distance_to_trainstation)+ log(Distance_nearestparking) + log(Distance_neareststop) + log(Distance_to_jet)+ log(Distance_to_catedral) + log(Distance_to_nationpalace),Bigdata) Code
summary(completemodelrating1)
Call:
lm(formula = rating ~ rankingPosition + OpenedHours + averaged_score_competition +
averaged_price + French + Italian + European + Vegetarian +
Vegan + Mediterranean + Asian + Gluten_free + Spanish + Swiss +
Lunch + Dinner + Drinks + Brunch + Breakfast + Late_Night_Drinks +
log(Distance_to_trainstation) + log(Distance_nearestparking) +
log(Distance_neareststop) + log(Distance_to_jet) + log(Distance_to_catedral) +
log(Distance_to_nationpalace), data = Bigdata)
Residuals:
Min 1Q Median 3Q Max
-1.00853 -0.23639 -0.03135 0.25110 1.05061
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.649e+00 5.325e-01 6.852 1.97e-11 ***
rankingPosition -5.846e-04 8.438e-05 -6.928 1.21e-11 ***
OpenedHours -6.844e-03 3.487e-03 -1.963 0.050167 .
averaged_score_competition 2.263e-01 8.823e-02 2.564 0.010603 *
averaged_price 2.192e-05 1.030e-05 2.129 0.033726 *
French 2.296e-02 4.230e-02 0.543 0.587506
Italian 6.551e-02 4.589e-02 1.427 0.154021
European -1.447e-01 4.007e-02 -3.611 0.000333 ***
Vegetarian -1.844e-01 3.696e-02 -4.991 8.09e-07 ***
Vegan -1.651e-02 4.302e-02 -0.384 0.701264
Mediterranean -7.465e-02 4.618e-02 -1.616 0.106592
Asian -7.622e-02 4.854e-02 -1.570 0.116934
Gluten_free -3.940e-02 5.076e-02 -0.776 0.438000
Spanish 1.877e-01 9.757e-02 1.924 0.054857 .
Swiss 2.578e-02 5.228e-02 0.493 0.622145
Lunch -1.254e-01 6.056e-02 -2.071 0.038823 *
Dinner -4.367e-02 6.257e-02 -0.698 0.485488
Drinks 5.751e-02 3.626e-02 1.586 0.113309
Brunch 6.436e-02 5.814e-02 1.107 0.268808
Breakfast 1.106e-01 4.822e-02 2.294 0.022182 *
Late_Night_Drinks -6.888e-02 5.056e-02 -1.362 0.173614
log(Distance_to_trainstation) 3.269e-03 2.981e-02 0.110 0.912722
log(Distance_nearestparking) 1.322e-02 2.318e-02 0.570 0.568708
log(Distance_neareststop) -7.442e-04 2.317e-02 -0.032 0.974390
log(Distance_to_jet) -2.750e-02 3.893e-02 -0.706 0.480286
log(Distance_to_catedral) 5.557e-02 3.252e-02 1.709 0.088067 .
log(Distance_to_nationpalace) -8.148e-03 5.450e-02 -0.150 0.881207
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3631 on 545 degrees of freedom
Multiple R-squared: 0.1897, Adjusted R-squared: 0.151
F-statistic: 4.907 on 26 and 545 DF, p-value: 1.579e-13
Then by following the same structure, we applied a backward selection based on AIC:
Code
###Backward induction
null_model <- lm(rating ~ 1, data = Bigdata)
final_modelrating <- step(completemodelrating1, scope = list(lower = null_model, upper = completemodelrating), direction = "backward")Code
summary(final_modelrating)
Call:
lm(formula = rating ~ rankingPosition + OpenedHours + averaged_score_competition +
averaged_price + European + Vegetarian + Mediterranean +
Asian + Spanish + Lunch + Drinks + Breakfast + Late_Night_Drinks +
log(Distance_to_catedral), data = Bigdata)
Residuals:
Min 1Q Median 3Q Max
-0.9674 -0.2421 -0.0352 0.2436 1.0119
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.565e+00 3.085e-01 11.557 < 2e-16 ***
rankingPosition -5.450e-04 7.727e-05 -7.053 5.22e-12 ***
OpenedHours -7.259e-03 3.410e-03 -2.129 0.033710 *
averaged_score_competition 2.160e-01 7.206e-02 2.998 0.002841 **
averaged_price 2.242e-05 1.019e-05 2.200 0.028252 *
European -1.229e-01 3.398e-02 -3.617 0.000325 ***
Vegetarian -1.865e-01 3.396e-02 -5.494 6.00e-08 ***
Mediterranean -6.913e-02 4.168e-02 -1.658 0.097796 .
Asian -1.021e-01 4.613e-02 -2.214 0.027200 *
Spanish 1.760e-01 9.578e-02 1.837 0.066672 .
Lunch -1.253e-01 5.797e-02 -2.162 0.031070 *
Drinks 6.282e-02 3.542e-02 1.774 0.076643 .
Breakfast 1.296e-01 4.484e-02 2.891 0.003989 **
Late_Night_Drinks -7.017e-02 4.939e-02 -1.421 0.155899
log(Distance_to_catedral) 4.342e-02 2.398e-02 1.810 0.070779 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3609 on 557 degrees of freedom
Multiple R-squared: 0.1819, Adjusted R-squared: 0.1614
F-statistic: 8.847 on 14 and 557 DF, p-value: < 2.2e-16
By applying backward selection based on AIC, we are told that some of the variables are rejected. Our final regression is then :
Rating =\beta_0 + \beta_1*rankingPosition + \beta_2*OpenedHours + \beta_3*log(Distancecatedral)+\\ \beta_{4}*Scorecompetition + \beta_{5}*Averagedprice + \beta_{6}*European + \beta_{7}*Mediterranean + \\ \beta_{8}*Vegeterian + \beta_{9}*Asian + \beta_{10}*Spanish + \beta_{11}*Lunch + \beta_{12}*Breakfast + \\ \beta_{13}*Drinks + \beta_{14}*LateNightDrinks
We check if there is a multi-collinearity issue:
Code
olsrr::ols_vif_tol(final_modelrating) %>% kableExtra::kable(digits = 3) %>% kableExtra::kable_styling(c("striped", "bordered")) %>% kableExtra::scroll_box(width = "100%", height = "300px")| Variables | Tolerance | VIF |
|---|---|---|
| rankingPosition | 0.841 | 1.189 |
| OpenedHours | 0.805 | 1.242 |
| averaged_score_competition | 0.909 | 1.100 |
| averaged_price | 0.984 | 1.017 |
| European | 0.798 | 1.253 |
| Vegetarian | 0.819 | 1.221 |
| Mediterranean | 0.895 | 1.118 |
| Asian | 0.714 | 1.400 |
| Spanish | 0.972 | 1.029 |
| Lunch | 0.850 | 1.177 |
| Drinks | 0.738 | 1.355 |
| Breakfast | 0.839 | 1.192 |
| Late_Night_Drinks | 0.738 | 1.355 |
| log(Distance_to_catedral) | 0.905 | 1.105 |
No one seems to have a severe issue since the VIF is below 5.
Code
forecast::accuracy(final_modelrating) %>% tibble::as_tibble() %>% dplyr::select(RMSE, MAE, MASE) %>%
kableExtra::kable(caption = "Accuracy of the Linear Model", align = 'c') %>%
kableExtra::kable_styling(c("striped", "bordered"),
full_width = FALSE,
position = "center")| RMSE | MAE | MASE |
|---|---|---|
| 0.3561674 | 0.288294 | 0.8448291 |
Code
lindia::gg_qqplot(final_modelrating)
An RMSE of 0.3561674 indicates that, on average, the model’s predictions are off by approximately 0.356 units from the actual values. An MAE of 0.288294 means that, on average, the model’s predictions deviate by approximately 0.288 units from the actual values.
3.5.3 Predictive modeling with Number of Reviews
The predictive modeling is a robust technique for assessing the performance of a model by partitioning the dataset into K subsets. This method provides a more comprehensive evaluation, reducing the risk of overfitting or underfitting, and offers a more reliable estimate of the model’s generalization performance on unseen data.
LOOCV
Code
#LOOCV
train.control <- trainControl(method = "LOOCV")
# Train the model
model_lo <- Bigdata %>%
train(Number_of_reviews ~.,., method = "lm", trControl = train.control)
# Summarize the results
#print(model_lo)
summary(model_lo)
Call:
lm(formula = .outcome ~ ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-911.99 -47.45 -4.03 34.96 1416.47
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.304e+02 2.557e+02 2.074 0.038512 *
rating -1.144e+02 1.794e+01 -6.379 3.85e-10 ***
photoCount 1.640e+00 8.331e-02 19.683 < 2e-16 ***
rankingPosition -1.341e-01 5.699e-02 -2.354 0.018929 *
rawRanking 1.729e+01 4.338e+01 0.399 0.690415
Distance_to_trainstation 7.842e-03 4.760e-02 0.165 0.869195
Distance_nearestparking -4.097e-02 3.423e-02 -1.197 0.231864
Distance_neareststop 2.768e-02 4.634e-02 0.597 0.550499
Distance_to_jet 7.628e-02 4.871e-02 1.566 0.117928
Distance_to_catedral -1.429e-01 4.389e-02 -3.255 0.001206 **
Distance_to_patekmuseum 7.391e-02 3.320e-02 2.227 0.026393 *
Distance_to_botanicgarden 5.584e-02 9.460e-02 0.590 0.555254
Distance_to_nationpalace -8.497e-02 2.153e-01 -0.395 0.693314
Distance_to_brokenchair 1.205e-02 1.896e-01 0.064 0.949354
averaged_score_competition -8.210e+00 4.014e+01 -0.205 0.838015
French -1.024e+01 1.681e+01 -0.609 0.542960
Italian 4.867e+00 1.824e+01 0.267 0.789688
European 6.199e+00 1.610e+01 0.385 0.700314
Vegetarian -1.647e+01 1.494e+01 -1.102 0.270779
Vegan -7.922e+00 1.698e+01 -0.466 0.641099
Mediterranean -2.044e+01 1.825e+01 -1.120 0.263282
Asian 4.956e+00 1.929e+01 0.257 0.797287
Gluten_free -8.758e+00 2.101e+01 -0.417 0.677007
Spanish 5.991e+00 3.866e+01 0.155 0.876913
Swiss 4.395e+00 2.076e+01 0.212 0.832388
Lunch 1.545e+01 2.408e+01 0.642 0.521343
Drinks -5.037e+01 1.435e+01 -3.509 0.000487 ***
Brunch 3.278e+01 2.324e+01 1.411 0.158934
Breakfast -3.708e+01 1.926e+01 -1.925 0.054797 .
Dinner -2.317e+01 2.476e+01 -0.936 0.349744
Late_Night_Drinks 5.161e+01 2.001e+01 2.579 0.010164 *
averaged_price 1.809e-03 4.086e-03 0.443 0.658077
OpenedHours 4.543e+00 1.379e+00 3.294 0.001053 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 143.3 on 539 degrees of freedom
Multiple R-squared: 0.6911, Adjusted R-squared: 0.6728
F-statistic: 37.69 on 32 and 539 DF, p-value: < 2.2e-16
K-Fold Cross Validation
Code
set.seed(123)
train.control1 <- trainControl(method = "cv", number = 10)
# Train the model
model_k <- Bigdata %>%
train(Number_of_reviews ~., ., method = "lm", trControl = train.control1)
# Summarize the results
#print(model_k)
summary(model_k)
Call:
lm(formula = .outcome ~ ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-911.99 -47.45 -4.03 34.96 1416.47
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.304e+02 2.557e+02 2.074 0.038512 *
rating -1.144e+02 1.794e+01 -6.379 3.85e-10 ***
photoCount 1.640e+00 8.331e-02 19.683 < 2e-16 ***
rankingPosition -1.341e-01 5.699e-02 -2.354 0.018929 *
rawRanking 1.729e+01 4.338e+01 0.399 0.690415
Distance_to_trainstation 7.842e-03 4.760e-02 0.165 0.869195
Distance_nearestparking -4.097e-02 3.423e-02 -1.197 0.231864
Distance_neareststop 2.768e-02 4.634e-02 0.597 0.550499
Distance_to_jet 7.628e-02 4.871e-02 1.566 0.117928
Distance_to_catedral -1.429e-01 4.389e-02 -3.255 0.001206 **
Distance_to_patekmuseum 7.391e-02 3.320e-02 2.227 0.026393 *
Distance_to_botanicgarden 5.584e-02 9.460e-02 0.590 0.555254
Distance_to_nationpalace -8.497e-02 2.153e-01 -0.395 0.693314
Distance_to_brokenchair 1.205e-02 1.896e-01 0.064 0.949354
averaged_score_competition -8.210e+00 4.014e+01 -0.205 0.838015
French -1.024e+01 1.681e+01 -0.609 0.542960
Italian 4.867e+00 1.824e+01 0.267 0.789688
European 6.199e+00 1.610e+01 0.385 0.700314
Vegetarian -1.647e+01 1.494e+01 -1.102 0.270779
Vegan -7.922e+00 1.698e+01 -0.466 0.641099
Mediterranean -2.044e+01 1.825e+01 -1.120 0.263282
Asian 4.956e+00 1.929e+01 0.257 0.797287
Gluten_free -8.758e+00 2.101e+01 -0.417 0.677007
Spanish 5.991e+00 3.866e+01 0.155 0.876913
Swiss 4.395e+00 2.076e+01 0.212 0.832388
Lunch 1.545e+01 2.408e+01 0.642 0.521343
Drinks -5.037e+01 1.435e+01 -3.509 0.000487 ***
Brunch 3.278e+01 2.324e+01 1.411 0.158934
Breakfast -3.708e+01 1.926e+01 -1.925 0.054797 .
Dinner -2.317e+01 2.476e+01 -0.936 0.349744
Late_Night_Drinks 5.161e+01 2.001e+01 2.579 0.010164 *
averaged_price 1.809e-03 4.086e-03 0.443 0.658077
OpenedHours 4.543e+00 1.379e+00 3.294 0.001053 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 143.3 on 539 degrees of freedom
Multiple R-squared: 0.6911, Adjusted R-squared: 0.6728
F-statistic: 37.69 on 32 and 539 DF, p-value: < 2.2e-16
Code
##we obtain a better RMSE and R-squared with the k-fold cross validation than LOOCV.With 580 observations, we have a reasonably sized dataset for regression. Using 10-fold cross-validation strikes a good balance between having sufficient data in each fold for training and validation while still providing a reasonable estimate of model performance. The fact that the number of reviews ranges from 10 to 2,200 suggests some variability in our target variable. Using a higher k (like 10) can help ensure that the entire range of the target variable is represented in both training and validation sets across the folds. Then, we compared our results with the methods as CV and LOOCV. We can see a lower R^2_a for our two training models done with LOOCV (Leave One Out Cross Validation) and KCV (K-Cross Validation). We also have a better RMSE with our lineal model. We can conclude that we have some overfitting in our model, because the prediction data is worst than our initial one.
4 Exploratory Analysis
4.1 Exploring Multiple Regression
4.1.1 Number of Reviews
Code
model3 <- lm(Number_of_reviews ~rankingPosition + OpenedHours + averaged_score_competition * averaged_price+ French + Italian + European +Vegetarian + Vegan + Mediterranean + Asian + Gluten_free + Spanish + Swiss + Lunch + Dinner +Drinks + Brunch + Breakfast*log(Distance_to_trainstation) + Late_Night_Drinks + log(Distance_nearestparking) + log(Distance_neareststop) + log(Distance_to_jet)+ log(Distance_to_catedral) + log(Distance_to_nationpalace),Bigdata)Code
summary(model3)
Call:
lm(formula = Number_of_reviews ~ rankingPosition + OpenedHours +
averaged_score_competition * averaged_price + French + Italian +
European + Vegetarian + Vegan + Mediterranean + Asian + Gluten_free +
Spanish + Swiss + Lunch + Dinner + Drinks + Brunch + Breakfast *
log(Distance_to_trainstation) + Late_Night_Drinks + log(Distance_nearestparking) +
log(Distance_neareststop) + log(Distance_to_jet) + log(Distance_to_catedral) +
log(Distance_to_nationpalace), data = Bigdata)
Residuals:
Min 1Q Median 3Q Max
-342.85 -97.43 -18.83 52.74 1823.91
Coefficients:
Estimate Std. Error t value
(Intercept) 1250.85230 350.83401 3.565
rankingPosition -0.38406 0.04811 -7.982
OpenedHours 6.85513 1.99492 3.436
averaged_score_competition -95.43806 65.60948 -1.455
averaged_price 1.60597 2.06135 0.779
French 9.08837 24.12488 0.377
Italian 7.19121 26.15266 0.275
European 38.50535 22.94541 1.678
Vegetarian 9.88746 21.05787 0.470
Vegan -7.08545 24.53101 -0.289
Mediterranean -26.21663 26.32632 -0.996
Asian -20.93725 27.64762 -0.757
Gluten_free 123.97509 28.95646 4.281
Spanish -42.99777 56.10955 -0.766
Swiss 17.32762 29.80252 0.581
Lunch 15.30256 34.49388 0.444
Dinner -1.15438 36.09453 -0.032
Drinks -61.05951 20.68942 -2.951
Brunch 75.73020 33.15977 2.284
Breakfast -409.66118 244.46837 -1.676
log(Distance_to_trainstation) 4.36497 17.68657 0.247
Late_Night_Drinks 69.11481 28.82954 2.397
log(Distance_nearestparking) -10.64925 13.22015 -0.806
log(Distance_neareststop) -17.27712 13.19577 -1.309
log(Distance_to_jet) 21.61045 22.18217 0.974
log(Distance_to_catedral) -56.14928 18.55688 -3.026
log(Distance_to_nationpalace) -38.50518 31.10007 -1.238
averaged_score_competition:averaged_price -0.37752 0.48504 -0.778
Breakfast:log(Distance_to_trainstation) 44.16263 35.80339 1.233
Pr(>|t|)
(Intercept) 0.000395 ***
rankingPosition 8.59e-15 ***
OpenedHours 0.000635 ***
averaged_score_competition 0.146347
averaged_price 0.436267
French 0.706527
Italian 0.783444
European 0.093897 .
Vegetarian 0.638874
Vegan 0.772817
Mediterranean 0.319775
Asian 0.449205
Gluten_free 2.20e-05 ***
Spanish 0.443820
Swiss 0.561202
Lunch 0.657486
Dinner 0.974498
Drinks 0.003302 **
Brunch 0.022769 *
Breakfast 0.094368 .
log(Distance_to_trainstation) 0.805159
Late_Night_Drinks 0.016850 *
log(Distance_nearestparking) 0.420865
log(Distance_neareststop) 0.190989
log(Distance_to_jet) 0.330378
log(Distance_to_catedral) 0.002597 **
log(Distance_to_nationpalace) 0.216212
averaged_score_competition:averaged_price 0.436722
Breakfast:log(Distance_to_trainstation) 0.217932
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 206.8 on 543 degrees of freedom
Multiple R-squared: 0.3517, Adjusted R-squared: 0.3182
F-statistic: 10.52 on 28 and 543 DF, p-value: < 2.2e-16
We intended to make interact the Averaged score of the competition with the averaged price of a restaurant. In addition, we also tried to observe a different impact by making interact Breakfast with the Distance to the train station, assuming a highest people influence in the morning, for the restaurant close to the station.
Code
bigmodel2 <- Bigdata %>%
lm(Number_of_reviews~ Distance_to_trainstation +Distance_nearestparking + Distance_neareststop + Distance_to_jet
+ Distance_to_catedral +Distance_to_patekmuseum +Distance_to_botanicgarden + Distance_to_nationpalace +
Distance_to_brokenchair,.) Code
summary(bigmodel2)
Call:
lm(formula = Number_of_reviews ~ Distance_to_trainstation + Distance_nearestparking +
Distance_neareststop + Distance_to_jet + Distance_to_catedral +
Distance_to_patekmuseum + Distance_to_botanicgarden + Distance_to_nationpalace +
Distance_to_brokenchair, data = .)
Residuals:
Min 1Q Median 3Q Max
-259.76 -112.85 -54.05 26.71 2074.40
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 168.16440 126.98306 1.324 0.18594
Distance_to_trainstation -0.11278 0.06722 -1.678 0.09392 .
Distance_nearestparking 0.01925 0.05782 0.333 0.73932
Distance_neareststop -0.09056 0.07765 -1.166 0.24401
Distance_to_jet 0.11402 0.08072 1.413 0.15836
Distance_to_catedral -0.21738 0.07112 -3.057 0.00235 **
Distance_to_patekmuseum 0.14825 0.04888 3.033 0.00253 **
Distance_to_botanicgarden 0.18644 0.15451 1.207 0.22807
Distance_to_nationpalace -0.50554 0.34353 -1.472 0.14169
Distance_to_brokenchair 0.35832 0.30271 1.184 0.23702
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 245.8 on 562 degrees of freedom
Multiple R-squared: 0.05202, Adjusted R-squared: 0.03684
F-statistic: 3.427 on 9 and 562 DF, p-value: 0.0004039
By making our different models, we’ve started by all the distances variables and had seen that some of them were significant. By adding variables, step by step all the distances ones started to not be significant anymore. We wanted then to make a specific model based on them.
Code
bigmodel2.2 <- Bigdata %>%
lm(Number_of_reviews~ French + Italian + European +
Vegetarian + Vegan + Mediterranean + Asian + Gluten_free + Spanish + Swiss,.) Code
summary(bigmodel2.2)
Call:
lm(formula = Number_of_reviews ~ French + Italian + European +
Vegetarian + Vegan + Mediterranean + Asian + Gluten_free +
Spanish + Swiss, data = .)
Residuals:
Min 1Q Median 3Q Max
-405.28 -91.24 -39.51 23.41 2200.34
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 61.034 20.574 2.967 0.00314 **
French 23.295 26.635 0.875 0.38216
Italian -1.905 28.660 -0.066 0.94703
European 69.330 24.848 2.790 0.00545 **
Vegetarian 54.524 22.844 2.387 0.01733 *
Vegan 19.976 26.909 0.742 0.45818
Mediterranean -18.783 29.082 -0.646 0.51862
Asian -38.975 29.513 -1.321 0.18717
Gluten_free 204.040 30.958 6.591 1.01e-10 ***
Spanish -50.739 61.589 -0.824 0.41039
Swiss 23.086 32.875 0.702 0.48283
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 231.6 on 561 degrees of freedom
Multiple R-squared: 0.1599, Adjusted R-squared: 0.1449
F-statistic: 10.68 on 10 and 561 DF, p-value: < 2.2e-16
Following the same logic, we then started to isolate each category of variables to see their impact individually. In that way we made the following model based on the Cuisine variables
Code
##With some interaction
bigmodel2.2 <- Bigdata %>%
lm(Number_of_reviews~ French + Italian*Dinner + European +
Vegetarian + Vegan + Mediterranean + Asian + Gluten_free + Spanish + Swiss,.)
#summary(bigmodel2.2)
bigmodel2.3 <- Bigdata %>%
lm(Number_of_reviews~ Italian*Dinner + European*French +
Vegetarian + Vegan + Mediterranean + Asian + Gluten_free + Spanish + Swiss,.)
#summary(bigmodel2.3)Here we tried to include an interaction to see if a different impact would be observed. In fact, we made interact Italian with Dinner and European with French.
Code
bigmodel4 <- Bigdata %>%
lm(Number_of_reviews ~rankingPosition + OpenedHours + averaged_score_competition+ French + Italian + European +Vegetarian + Vegan + Mediterranean + Asian + Gluten_free + Spanish + Swiss + Lunch + Dinner +Drinks + Brunch + Breakfast + Late_Night_Drinks +log(Distance_to_trainstation)+ log(Distance_nearestparking) + log(Distance_neareststop) + log(Distance_to_jet)+ log(Distance_to_catedral) + log(Distance_to_nationpalace),.)
summary(bigmodel4)
Call:
lm(formula = Number_of_reviews ~ rankingPosition + OpenedHours +
averaged_score_competition + French + Italian + European +
Vegetarian + Vegan + Mediterranean + Asian + Gluten_free +
Spanish + Swiss + Lunch + Dinner + Drinks + Brunch + Breakfast +
Late_Night_Drinks + log(Distance_to_trainstation) + log(Distance_nearestparking) +
log(Distance_neareststop) + log(Distance_to_jet) + log(Distance_to_catedral) +
log(Distance_to_nationpalace), data = .)
Residuals:
Min 1Q Median 3Q Max
-359.72 -96.22 -18.38 52.39 1814.80
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1366.90523 302.98678 4.511 7.89e-06 ***
rankingPosition -0.38380 0.04799 -7.997 7.67e-15 ***
OpenedHours 7.09796 1.98408 3.577 0.000378 ***
averaged_score_competition -126.30324 50.20428 -2.516 0.012163 *
French 11.09698 24.06812 0.461 0.644935
Italian 8.63323 26.11518 0.331 0.741086
European 34.98067 22.80353 1.534 0.125608
Vegetarian 9.16936 21.00376 0.437 0.662604
Vegan -5.36153 24.48033 -0.219 0.826721
Mediterranean -27.77958 26.27677 -1.057 0.290892
Asian -21.28139 27.52496 -0.773 0.439758
Gluten_free 124.95499 28.88586 4.326 1.81e-05 ***
Spanish -37.85744 55.52034 -0.682 0.495613
Swiss 15.40887 29.74699 0.518 0.604670
Lunch 15.00397 34.45116 0.436 0.663361
Dinner -9.65566 35.60394 -0.271 0.786343
Drinks -63.23221 20.61695 -3.067 0.002269 **
Brunch 73.02830 33.08638 2.207 0.027715 *
Breakfast -110.84424 27.43846 -4.040 6.12e-05 ***
Late_Night_Drinks 67.86674 28.76772 2.359 0.018669 *
log(Distance_to_trainstation) 11.21674 16.94930 0.662 0.508390
log(Distance_nearestparking) -11.47226 13.19184 -0.870 0.384875
log(Distance_neareststop) -16.99712 13.18428 -1.289 0.197876
log(Distance_to_jet) 22.34001 22.07307 1.012 0.311942
log(Distance_to_catedral) -56.49002 18.49767 -3.054 0.002369 **
log(Distance_to_nationpalace) -41.35904 31.01229 -1.334 0.182880
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 206.6 on 546 degrees of freedom
Multiple R-squared: 0.349, Adjusted R-squared: 0.3192
F-statistic: 11.71 on 25 and 546 DF, p-value: < 2.2e-16
4.1.2 Rating
As we saw in the general model composed of all the variables, the distances variables don’t have an impact on rating. In this 4th part of our report, we decided to omit those variables which would have not bring any relevant information for the regression. Nevertheless, we decided to try a regression based on the different mealTypes.
Code
model4 <- lm(rating ~rankingPosition + OpenedHours + averaged_score_competition * averaged_price+ French + Italian + European +Vegetarian + Vegan + Mediterranean + Asian + Gluten_free + Spanish + Swiss + Lunch + Dinner +Drinks + Brunch + Breakfast*log(Distance_to_trainstation) + Late_Night_Drinks + log(Distance_nearestparking) + log(Distance_neareststop) + log(Distance_to_jet)+ log(Distance_to_catedral) + log(Distance_to_nationpalace),Bigdata)Code
summary(model4)
Call:
lm(formula = rating ~ rankingPosition + OpenedHours + averaged_score_competition *
averaged_price + French + Italian + European + Vegetarian +
Vegan + Mediterranean + Asian + Gluten_free + Spanish + Swiss +
Lunch + Dinner + Drinks + Brunch + Breakfast * log(Distance_to_trainstation) +
Late_Night_Drinks + log(Distance_nearestparking) + log(Distance_neareststop) +
log(Distance_to_jet) + log(Distance_to_catedral) + log(Distance_to_nationpalace),
data = Bigdata)
Residuals:
Min 1Q Median 3Q Max
-0.99882 -0.24104 -0.02906 0.24420 1.05087
Coefficients:
Estimate Std. Error t value
(Intercept) 3.8158658 0.6161996 6.193
rankingPosition -0.0005889 0.0000845 -6.969
OpenedHours -0.0069183 0.0035038 -1.974
averaged_score_competition 0.1917746 0.1152355 1.664
averaged_price -0.0014835 0.0036205 -0.410
French 0.0204046 0.0423726 0.482
Italian 0.0643959 0.0459341 1.402
European -0.1414654 0.0403010 -3.510
Vegetarian -0.1852601 0.0369857 -5.009
Vegan -0.0181144 0.0430859 -0.420
Mediterranean -0.0739046 0.0462391 -1.598
Asian -0.0753480 0.0485599 -1.552
Gluten_free -0.0375155 0.0508587 -0.738
Spanish 0.1948705 0.0985500 1.977
Swiss 0.0288602 0.0523447 0.551
Lunch -0.1260464 0.0605845 -2.081
Dinner -0.0343219 0.0633959 -0.541
Drinks 0.0594782 0.0363386 1.637
Brunch 0.0670524 0.0582413 1.151
Breakfast -0.4471891 0.4293805 -1.041
log(Distance_to_trainstation) -0.0074833 0.0310644 -0.241
Late_Night_Drinks -0.0698023 0.0506358 -1.379
log(Distance_nearestparking) 0.0148760 0.0232197 0.641
log(Distance_neareststop) -0.0010721 0.0231768 -0.046
log(Distance_to_jet) -0.0289317 0.0389604 -0.743
log(Distance_to_catedral) 0.0581749 0.0325930 1.785
log(Distance_to_nationpalace) -0.0043696 0.0546237 -0.080
averaged_score_competition:averaged_price 0.0003543 0.0008519 0.416
Breakfast:log(Distance_to_trainstation) 0.0821178 0.0628845 1.306
Pr(>|t|)
(Intercept) 1.17e-09 ***
rankingPosition 9.26e-12 ***
OpenedHours 0.048834 *
averaged_score_competition 0.096650 .
averaged_price 0.682151
French 0.630318
Italian 0.161511
European 0.000485 ***
Vegetarian 7.41e-07 ***
Vegan 0.674342
Mediterranean 0.110555
Asian 0.121328
Gluten_free 0.461050
Spanish 0.048504 *
Swiss 0.581621
Lunch 0.037947 *
Dinner 0.588461
Drinks 0.102256
Brunch 0.250121
Breakfast 0.298119
log(Distance_to_trainstation) 0.809727
Late_Night_Drinks 0.168611
log(Distance_nearestparking) 0.522010
log(Distance_neareststop) 0.963121
log(Distance_to_jet) 0.458050
log(Distance_to_catedral) 0.074838 .
log(Distance_to_nationpalace) 0.936272
averaged_score_competition:averaged_price 0.677685
Breakfast:log(Distance_to_trainstation) 0.192156
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3632 on 543 degrees of freedom
Multiple R-squared: 0.1924, Adjusted R-squared: 0.1508
F-statistic: 4.621 on 28 and 543 DF, p-value: 3.769e-13
We wanted to see a possible interaction between the averaged score of the competition and the averaged price. In addition ,we also made an interaction between Italian and European because we have some doubts about possible impact’s overlap. We think that European could include all the different type of cuisine like French or Spanish for instance. In the other hand, an interaction between Brunch and Breakfast or OpenedHours could also be interesting.
Code
model5 <- Bigdata %>%
lm(rating~ French + Italian + European +
Vegetarian + Vegan + Mediterranean + Asian + Gluten_free + Spanish + Swiss,.) Code
summary(model5)
Call:
lm(formula = rating ~ French + Italian + European + Vegetarian +
Vegan + Mediterranean + Asian + Gluten_free + Spanish + Swiss,
data = .)
Residuals:
Min 1Q Median 3Q Max
-1.23239 -0.25008 -0.09473 0.27848 0.89136
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.390453 0.033986 129.185 < 2e-16 ***
French 0.033781 0.043999 0.768 0.44295
Italian 0.010058 0.047343 0.212 0.83184
European -0.113691 0.041046 -2.770 0.00579 **
Vegetarian -0.168123 0.037736 -4.455 1.01e-05 ***
Vegan 0.027755 0.044451 0.624 0.53262
Mediterranean -0.070264 0.048042 -1.463 0.14415
Asian -0.083995 0.048753 -1.723 0.08546 .
Gluten_free 0.039285 0.051141 0.768 0.44271
Spanish 0.190014 0.101740 1.868 0.06233 .
Swiss -0.005519 0.054306 -0.102 0.91909
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3826 on 561 degrees of freedom
Multiple R-squared: 0.07432, Adjusted R-squared: 0.05782
F-statistic: 4.504 on 10 and 561 DF, p-value: 3.835e-06
Code
model6 <- Bigdata %>%
lm(rating~ Lunch + Drinks + Brunch + Breakfast + Dinner + Late_Night_Drinks,.) Code
summary(model6)
Call:
lm(formula = rating ~ Lunch + Drinks + Brunch + Breakfast + Dinner +
Late_Night_Drinks, data = .)
Residuals:
Min 1Q Median 3Q Max
-1.1893 -0.2355 -0.1893 0.3104 0.8107
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.41123 0.07404 59.577 < 2e-16 ***
Lunch -0.15865 0.06086 -2.607 0.00938 **
Drinks 0.07394 0.03813 1.939 0.05300 .
Brunch 0.04590 0.06115 0.751 0.45321
Breakfast 0.04075 0.04815 0.846 0.39780
Dinner -0.06297 0.06490 -0.970 0.33234
Late_Night_Drinks -0.07430 0.05298 -1.402 0.16136
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3903 on 565 degrees of freedom
Multiple R-squared: 0.02985, Adjusted R-squared: 0.01955
F-statistic: 2.897 on 6 and 565 DF, p-value: 0.008639
5. Predictor
Here is the link toward our predictor that we made with a shiny app:
6. Recommendations
Future work
Time Series Analysis: Having temporal dara to analyze trends over time. This could reveal seasonal variations or long-term changes in restaurant popularity and customer preferences.
Customer Sentiment Analysis: Get textual reviews to conduct sentiment analysis. It could provide insights into what customers particularly like or dislike about restaurants.
Competitive Analysis: Compare restaurants in Geneva with those in other cities or regions to identify unique trends or competitive advantages.
Economic Impact Analysis: Explore how changes in the restaurant industry (like new openings, closures, changes in ratings) correlate with economic indicators in Geneva.
Sustainability and Dietary Trends: Examine trends related to sustainability practices and the popularity of various dietary preferences, like an increasing number of restaurant proposing vegan, gluten-free, etc.